-
Notifications
You must be signed in to change notification settings - Fork 299
fix: Add support for unsigned Arrow datatypes in schema conversion #1617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Sorry, new to reviewing (and mostly new to the code base) so take comments with a grain of salt, but this approach seems brittle:
It seems a more robust solution would be to:
|
I have the same concern as @emkornfield , using Also the Python's implementation can serve as a good reference. Note that Pyiceberg use bid width while arrow-rs only provides |
@emkornfield Right, with the current approach, it has potential for silent data corruption because of Arrow's doc field dependency. @CTTY Thanks for the references, I will use this for casting. My updated approach uses safe bit-width casting for unsigned integer types following the proven iceberg-python implementation • uint8/uint16 → int32: Safe upcast with no overflow risk Let me know if there are any concerns, I will have the changes out otherwise. |
crates/iceberg/src/arrow/schema.rs
Outdated
// Cast unsigned types based on bit width (following Python implementation) | ||
DataType::UInt8 | DataType::UInt16 | DataType::UInt32 => { | ||
// Cast to next larger signed type to prevent overflow | ||
let bit_width = p.primitive_width().unwrap_or(0) * 8; // Convert bytes to bits |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems superflous, can't you just match on the data types and directly map them?
DataType::UInt8 | DataType::UInt16 => Ok(Type::Primitive(PrimitiveType::Int)
DataType::UInt32 => Ok(Type::Primitive(PrimitiveType::Long)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will make this logic simple
crates/iceberg/src/arrow/schema.rs
Outdated
@@ -378,7 +378,24 @@ impl ArrowSchemaVisitor for ArrowSchemaConverter { | |||
DataType::Int8 | DataType::Int16 | DataType::Int32 => { | |||
Ok(Type::Primitive(PrimitiveType::Int)) | |||
} | |||
// Cast unsigned types based on bit width (following Python implementation) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Cast unsigned types based on bit width (following Python implementation) | |
// Cast unsigned types based on bit width to allow for no data loss |
I'm sure python compatibility is a direct goal here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will incorporate this
crates/iceberg/src/arrow/schema.rs
Outdated
|
||
// Test UInt8/UInt16 → Int32 casting | ||
{ | ||
let arrow_field = Field::new("test", DataType::UInt8, false).with_metadata( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this doesn't look like this is testing UInt16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this scenario
crates/iceberg/src/arrow/schema.rs
Outdated
@@ -1717,6 +1734,60 @@ mod tests { | |||
} | |||
} | |||
|
|||
#[test] | |||
fn test_unsigned_type_casting() { | |||
// Test UInt32 → Int64 casting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to maybe parameterize the non-error cases as least with expected input/output to avoid boiler plate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -1717,6 +1734,60 @@ mod tests { | |||
} | |||
} | |||
|
|||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this probably isn't the right module, but it would probably be nice to have a test that actually exercises writing these types and then reading them back again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented an integration test for unsigned type roundtrip, but discovered that ParquetWriter also requires modification to handle unsigned data conversion. The issue stems from a type mismatch between schema and data.
The problem occurs because schema conversion (arrow_schema_to_schema
) transforms the metadata but leaves the actual data unchanged. When writing, Arrow validation fails due to this mismatch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think writing record batches that contain unsigned types is out of the scope of the original issue and can be tricky:
ParquetWriter
usesAsyncArrowWriter
under the hoodAsyncArrowWriter
uses an arrow schema that got converted from the Iceberg table schema- When converting an Iceberg schema to arrow schema, arrow schema won't have any unsigned types (and I don't think it makes sense to do so unless there is a valid use case)
- Schema mismatch between record batches and the arrow schema, arrow writer will fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, from the original issue it seems scope is ambiguous. It seems like this change it makes it possible to create a schema from arrow with unsigned types which might be helpful by itself, but imagine the next thing the user would want to do is actually the write the data?
It seems fine to check this in separately as long as there is a clean failure for the unsigned types (i.e. we don't silently lose data).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments, mostly nits. The additional test coverage would be my primary concern, the rests are mostly style nits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should figure out if this actually closes out the original issue or if we should keep it open for write support, but with my limited knowledge these changes seem reasonable.
Which issue does this PR close?
What changes are included in this PR?
Bug Fixes
Features
Code Changes
Files Modified
crates/iceberg/src/arrow/schema.rs
Impact
✅ No breaking changes - existing functionality preserved
✅ Safe type casting prevents overflow issues
✅ Clear error messages for unsupported UInt64 with alternatives
✅ Follows proven PyIceberg implementation approach
Are these changes tested?